POLI 572B

Michael Weaver

January 28, 2019

Least Squares: bivariate regression

Plan for Today

Mechanics of Least Squares

  • Linear regression = “Ordinary Least Squares” = “OLS”
  • Understand it as an algorithm
  • Why “least squares”
  • We will see, it generalizes the mean

Along the way:

  • New interpretation of the mean
  • Some very basic matrix algebra

Why least squares?

  • A “simple” tool that is flexible to many applications
  • Ubiquitous, performs pretty well
    • often “good enough”

Eventually, we will apply it to causality

  • it evaluates mean of some variables \(Y\) across different values \(Z\).
  • if we make some assumptions (recall our analysis of experiments), can interpret this causally
  • it permits us to address spurious correlations
  • approximate “random” exposure after conditioning; making some bigger assumptions

For Today:

leaving behind causality for the moment

  • nothing today is about causality, proving causality, or making assumptions about causality.

leaving behind statistical models

  • we are not making a model to estimate a parameter with some level of uncertainty (not about statistics)

Just the mathematical properties of regression/least squares

  • least squares is just an algorithm with certain mathematical properties, nothing more, nothing less.
  • without making further assumptions, we can generate a line but the interpretation is limited.

Basic Concepts

Covariance and Correlation

How are two variables associated?

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

How are these variables associated?

Covariance and Correlation

correlation: allows us to describe the assocation numerically.

Computation

Covariance

\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]

\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]

  • Divide by \(n-1\) for sample covariance.

Computation

Variance

Variance is also the covariance of a variable with itself:

\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]

\[Var(X) = \overline{x^2} - \bar{x}^2\]

Computation

Covariance of tree width and tree height:

## [1] 10.38333

Covariance of tree width and timber volume:

## [1] 49.88812

Computation

Why is the second one larger?

Computation

Why is the second one larger?

Computation

Scale of covariance reflects scale of the variables.

  • Can’t directly compare the two covariances

Covariance: Intuition

Computation

Covariance

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

Pearson Correlation

\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)

  • Dividing by product of standard deviations scales the covariance
  • \(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)

Properties

  • Correlation coefficient must be between -1, 1

  • At -1 or 1, the points are on a single line (with a fixed slope)

  • Negative value implies increased in one variable associated with decrease in the other.

  • At 0, the covariance must be 0

Interpretation

  • Correlation of \((x,y)\) is same as correlation of \((y,x)\)

  • Generally, values closer to -1 or 1 imply “stronger” association

Interpretation

But:

  • Correlations cannot be understood using ratios.
  • Correlation of 0.8 is not “twice” as correlated as 0.4.
  • Correlation may poorly represent association in presence of outliers or nonlinearities.
  • See this in a second
  • Correlation is not causation

Interpretation

Which of has the greatest association? The least?

The Mean: Revisited

Squared Deviations

Why are we always squaring differences?

  • Variance
  • Covariance
  • Mean Squared Error

It is about distance

Squared Deviations

What is the distance between two points?

Squared Deviations

What is the distance between two points?

Squared Deviations

What is the distance between two points?

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_n - q_n)^2}\]

Remember Pythagoras?

The Mean

The mean minimizes the variance.

  • we saw this before (why we divide by \(n-1\) when estimating the population variance)
  • this is the same as minimizing the distance, because variance is mathematically linked to the distance calculation

Deriving the mean:

Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.

What is a vector?

  • a vector is a one-dimensional array of numbers. It is \(2 \times 1\).

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\] - a matrix is a rectangular array of numbers. \(X\) is a \(3 \times 2\) matrix. \(3\) rows and \(2\) columns.

\[ X = \begin{pmatrix} 1 & -1 \\ 1 & 2 \\ 1 & 4 \end{pmatrix}\]

Vectors have geometric interpretation

Vectors Can be Added

Matrices and vectors of the same dimensions can be added or subtracted element-by-element.

For example:

\[\begin{pmatrix}1 \\ 1 \end{pmatrix} + \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix}-1 \\ 4 \end{pmatrix}\]

Vectors Can be Added

Vectors Can be Multiplied

Vectors are matrices with a single column (or row). scalars are \(1 \times 1\) matrices

  • Matrices/vectors can be multiplied (element-by-element) by a scalar.

\[2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}\]

\[0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\]

Vectors Can be Multiplied

\(a = 2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}; \ b = 0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\)

Matrices can be Multiplied

Matrices must have dimensions that match:

  • \(m \times n\) matrix \(A\) can be multiplied by matrix \(B\) if \(B\) is \(n \times p\).
  • Matrix \(AB\) is \(m \times p\)
  • Rows of \(A\) multiplied with columns of \(B\) and then summed:

Matrices can be Multiplied

\[\begin{pmatrix} 1 & 1 \\ -1 & 2 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \\ 1 & 2 \end{pmatrix} =\]

\[\begin{pmatrix} (1 \cdot 1)+(1\cdot1) & (1\cdot-1) + (1 \cdot 2) \\ (-1\cdot1) + (2\cdot1) & (-1\cdot-1) + (2\cdot2) \end{pmatrix} = \]

\[ = \begin{pmatrix} 2 & 1 \\ 1 & 5 \end{pmatrix}\]

Inner Products

If \(u\) and \(v\) are \(n \times 1\) vectors, the inner product or dot product of \(u \bullet v = u' \times v\); \(u'\) is transpose of \(u\).

  • E.g.:

\[u = \begin{pmatrix} 1 \\ -2 \end{pmatrix}; \ v =\begin{pmatrix} 4 \\ 2 \end{pmatrix} \]

\[u \bullet v = \begin{pmatrix} 1 & -2 \end{pmatrix} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = 0\] - When the inner product is \(0\), then \(u\) and \(v\) are orthogonal: \(u \perp v\)

Orthogonal Vectors

Deriving the mean:

Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]

Deriving the mean:

Deriving the mean:

We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values of our vector \(y\).

Within the \(n\) dimensional space containing our vector \(y\), we want to pick the best prediction of \(y\) (which is in \(n\) dimensions) that is in one dimension: find a single point on a line.

Projecting this into the same dimensional space as \(y\) produces:

\(\hat{y} \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix}\).

\(\hat{y}\) must be on the blue line. We want to pick the point (prediction) that minimizes the distance to \(y\).

Deriving the mean:

\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)

can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):

\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)

and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:

\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)

Deriving the mean:

Deriving the mean:

This means our goal is to minimize \(\mathbf{e}\).

How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:

\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]

The minimum length is obtained when the angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\). That is to say

\(\begin{pmatrix} 1 \\ 1 \end{pmatrix} \perp \mathbf{e}\).

Deriving the mean:

We know that two vector are orthogonal (\(\perp\)) when their dot product is \(0\), so we can create the following equality and solve for \(\hat{y}\).

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \begin{pmatrix} \hat{y} & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

Deriving the mean:

\((\begin{pmatrix} 3 & 5 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) - (\hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) = 0\)

\((8) - (\hat{y} 2) = 0\)

\(8 = \hat{y} 2\)

\(4 = \hat{y}\)

Deriving the mean:

More generally:

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \begin{pmatrix} \hat{y} & \ldots & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \hat{y}\begin{pmatrix} 1 & \ldots & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

More generally:

\((\sum\limits_{i=1}^{n} y_i\cdot1) - \hat{y} \sum\limits_{i=1}^{n} 1 = 0\)

\(\sum\limits_{i=1}^{n} y_i = \hat{y} n\)

\(\frac{1}{n}\sum\limits_{i=1}^{n} y_i = \hat{y}\)

Conditional Expectation Function

Generalizing the Mean:

The mean is useful…

… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).

To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E(Y | X)\)

  • simplest version could be difference in means (experiments)

Generalizing the Mean:

So we are interested in finding some conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean of \(Y\)

conditional: because it is conditioned of different values of \(X\).

function: because \(E(Y) = f(X)\), there is some relationship we can look at between values of \(X\) and \(E(Y)\).

Generalizing the Mean:

There are many ways to get the conditional expectation function

One powerful and simple way is to assume that the conditional expectation function is linear.

That is to say \(E(Y)\) is linear in \(X\). The function takes the form of an equation of a line.

Equation of a line

Equation of a line

Equation of a line

\(slope = \frac{rise}{run} = \frac{y2-y1}{x2-x1}\)
  • Change in \(y\) with a 1 unit change in \(x\).

Equation of a line

Equation of a line

\(intercept = (y | x=0)\)
  • Value of \(y\) when \(x = 0\). Where the line crosses the \(y\)-axis.

Equation of a line

\(y = intercept + slope*x\)

or, by convention:

\(y = a + bx\)

Generalizing the Mean:

How do we choose the line that best captures:

\[E(Y) = a + b\cdot X\]

What line fits this?

Which line?

Which line?

Which line?

Graph of Averages

Which line?

The red line above is the regression line or the fit using least squares.

It closely approximates the conditional mean of son’s height (\(Y\)) across values of father’s height (\(X\)).

How do we obtain this line mathematically?

We can do it the same way we obtained the mean!

Deriving Least Squares

Regression works similarly:

Rather than reduce the \(n \times 1\)-dimensional vector \(\mathbf{Y}\) into one dimension (as we did with the mean), we reduce it into \(p\) (number of parameters) dimensional space. This requires more dimensions than can easily be visualized, but we still end up minimizing the distance between our \(n\) dimensional vector \(\mathbf{\hat{Y}}\) and the vector \(\mathbf{Y}\).

Deriving Least Squares

Given \(\mathbf{Y}\), an \(n \times 1\) dimensional vector of all values of dependent variables \(Y\) for \(n\) observations

and \(\mathbf{X}\), an \(n \times p\) dimensional matrix (\(p\) independent variables, including an intercept, \(n\) observations)

\(\mathbf{\hat{Y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (parameters) that we multiply by \(\mathbf{X}\).

Today we’ll assume there are only two parameters in \(\mathbf{\beta}\): \(a,b\) from \(Y_i = a + b \cdot X_i\), so \(p = 2\)

Deriving Least Squares

We want to choose \(\mathbf{\beta}\) or \(a,b\) such that the distance between \(\mathbf{Y}\) and \(\mathbf{\hat{Y}}\) is minimized. Or sum of squared residuals is minimized. (identical conditions)

Like before, the distance is minimized when the vector of residuals \(\mathbf{Y} - \mathbf{\hat{Y}} = \mathbf{e}\) is \(\perp\) to \(\mathbf{X}\)

Residuals

Deriving Least Squares

\(\mathbf{X}'_{p\times n}\mathbf{e}_{n\times1} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{\hat{Y}}) = 0\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{X{\beta}}) = 0\)

\(\mathbf{X}'\mathbf{Y} = \mathbf{X}'\mathbf{X{\beta}}\)

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

Note!

here \(\mathbf{X{\beta}} = a + b \cdot X\)

Deriving Least Squares

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression. If \(X\) is a column vector of 1s, \(\beta\) is just the mean of \(Y\). This is in fact identical to what we solved for earlier.

But we also want to know more intuitively what these matrix operations are doing! It isn’t magic.

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{X} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}\]

\[= \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{Y} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\] \[= \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} = n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

Inverting Matrices

How do we get \(^{-1}\)? This is inverting a matrix.

  • Inverse \(A^{-1}\) of matrix \(A\) that is square or \(p\times p\) has the property:

\[A \times A^{-1} = A^{-1} \times A = I_{p \times p} = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & \ddots & \ldots & 0 \\ 0 & \ldots & \ddots & 0 \\ 0 & \ldots & 0 & 1 \end{pmatrix}\]

This is an identity matrix with 1s on diagonal, 0s everywhere else.

Inverting Matrices

We need to get the determinant

For sake of ease, will show for a scalar and for a \(2 \times 2\) matrix:

\[det(a) = a\]

\[det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - cb\]

Inverting Matrices

Then we need to get the adjoint. It is the transpose of the matrix of cofactors (don’t ask me why):

\[adj(a) = 1\]

\[adj\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Inverting Matrices

The inverse of \(A\) is \(adj(A)/det(A)\)

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Deriving Least Squares

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X}) = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{n}{n^2(\overline{x^2} - \overline{x}^2)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

Deriving Least Squares

We can put it together to get: \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix} n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x}\ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

Deriving Least Squares

The slope:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[b = \frac{\overline{xy} - \overline{x} \ \overline{y}}{Var(x)} = \frac{Cov(x,y)}{Var(x)}\]

\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]

Deriving Least Squares

The slope:

  • Expresses how much mean of \(Y\) changes for a 1-unit change in \(X\)
  • When expressed as function of correlation coefficient \(r\), we see this rise (\(SD_y\)) over the run (\(SD_x\))

The Intercept:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[a = \frac{\overline{x^2}\overline{y} -\overline{x} \ \overline{xy}}{Var(x)} = \frac{(Var(x) + \overline{x}^2)\overline{y} - \overline{x}(Cov(x,y) + \overline{x}\overline{y})}{Var(x)}\]

\[= \frac{Var(x)\overline{y} + \overline{x}^2\overline{y} - \overline{x}^2\overline{y} - \overline{x}Cov(x,y)}{Var(x)}\]

\[= \overline{y} - \overline{x}\frac{Cov(x,y)}{Var(x)}\]

\[a = \overline{y} - \overline{x}\cdot b\]

Deriving Least Squares

The Intercept:

\[a = \overline{y} - \overline{x}\cdot b\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Deriving Least Squares

There are other ways to derive least squares.

  • Because we eventually want multiple variables, need to build from an intuition rooted in matrices and their relationship to distance.
  • Adding more variables puts into minimizing distance in a \(n > 3\) dimensional space, so it gets weird.
  • Nevertheless, this exercise is helpful

Summary

Key facts about regression:

The mathematical procedures we use in regression ensure that:

  1. the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

  2. \(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal. \(X'e = 0 \to \overline{xe}=0\); \(\overline{e}=0\) from above; so \(Cov(X,e) = \overline{xe}-\overline{x}\overline{e} = 0 - \overline{x}0 = 0\).

Key facts about regression:

These two facts are unrelated to assumptions we will make later for statistical and causal inference. They mathematical truths about regression/least squares.

Key ideas:

  • We can fit a regression line to any scatterplot (regardless of how sensical it is), if \(x\) has positive variance.

  • The regression line minimizes the sum of squared residuals (minimizes the distance between predicted values and actual values of \(y\)). This is why it is called “least squares”

  • Residuals \(e\) are always uncorrelated with \(x\) if there is an intercept because they are orthogonal \(x\) with mean of \(0\).

Key ideas:

  • We have no addressed in any way how this relates to a statistical model or a causal model.
  • Even as a summary exercise, the regression line might be invalid. If we want to summarize the linear conditional expectation function, it can be helpful if the linear approximation is about right (our example of heights)
  • If the conditional expectation function is non-linear (e.g. a “U” shape in mean of \(y\) across values of \(x\)) a linear regression model may not be informative